Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache #2702

Merged

Conversation

zpinto
Copy link
Contributor

@zpinto zpinto commented Nov 21, 2023

Issues

We found that there was significant shuffling occurring when attempting to perform a Node Swap operation on WAGED clusters. #2662

Description

After investigating, it was determined that the cause was due to the PartitionMovementConstraint and BaselineInfluenceConstraint soft constraints improperly scoring AssignableReplicas during the completion of a SWAP.

This was because those two soft-constraints were using InstanceName instead of LogicaId to determine if the SWAP_IN node was in the baseline or best possible state. It wouldn't be since the last time that the algorithm produced the baseline and best possible state, it contained the SWAP_OUT node instead. This causes the score to be lower and leads to incorrect assignment.

To fix this, those soft-constraints now look for LogicalId and the baseline and best possible state are passed in with instanceName replaced with logicalId.

Also, we are putting all of the logic for what instance's are assignable in one place, the BaseControllerDataProvider. We will also add Evacuation here in the future.

Tests

The previous integration tests have been modified to cover the changes.

We have also tested this logic on a production cluster copied to a testing environment.

Changes that Break Backward Compatibility (Optional)

NA

Despite adding new getters to BaseControllerDataProvider, the old signatures are kept with the same behavour.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Copy link
Contributor

@xyuanlu xyuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change, One general question: for spectator and EV, I think there are places where we still use getInstanceConfigMap. Will this include both swap in/out instances?

@zpinto zpinto marked this pull request as ready for review November 22, 2023 16:48
Copy link
Contributor

@xyuanlu xyuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question:
When we throttle rebalance with ERROR state replica, we count all replicas on swap-in and out. Do we need to ignore replicas on swap-in instance?
In IntermediateStateCalcState

@@ -239,7 +290,8 @@ private boolean validateOfflineInstancesLimit(final ResourceControllerDataProvid
final HelixManager manager) {
int maxOfflineInstancesAllowed = cache.getClusterConfig().getMaxOfflineInstancesAllowed();
if (maxOfflineInstancesAllowed >= 0) {
int offlineCount = cache.getAllInstances().size() - cache.getEnabledLiveInstances().size();
int offlineCount =
cache.getAssignableInstances().size() - cache.getAssignableEnabledLiveInstances().size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here might be tricky for Evacuate. For swap, because swap-in instances are mirror, so it is not considered in offline limit for EMM. But Evacuate will reduce capacity, we may want to separate these 2 cases.
Not related to this change though.

@zpinto
Copy link
Contributor Author

zpinto commented Nov 22, 2023

Failed tests on previous execution failed due to known flaky tests:

Test failed: testGetAllInstances(org.apache.helix.rest.server.TestInstancesAccessor) #2667 (there is PR for fix on this one)
Test failed: testCacheDataUpdates(org.apache.helix.metaclient.impl.zk.TestZkMetaClientCache) #2693

I was able to pass both of these tests locally.

@zpinto
Copy link
Contributor Author

zpinto commented Nov 22, 2023

Thanks for making the change, One general question: for spectator and EV, I think there are places where we still use getInstanceConfigMap. Will this include both swap in/out instances?

Will be addressing this in a future PR for spectator.

@zpinto
Copy link
Contributor Author

zpinto commented Nov 22, 2023

Will add comment with TODO to address 2->3 ST messages not being sent and throttled due to more messages being sent to SWAP_IN and counting against throttles.

This is also a known issue already and we should figure out a way to prioritize ST messages.

@xyuanlu
Copy link
Contributor

xyuanlu commented Nov 22, 2023

Had an offline review session and discussed several points. Generally looking good. Thanks for working on this!

…out hosts it in top or second top state, add throttling todos, and move assignable maps in cache out of propert cache block.
…eLogicalIds with stream instead of parallelStream to prevent concurrent modification exception.
… the same cluster as TestPerInstanceAccessor which adds an evacuate instance to the cluster. Any time TestPerInstanceAccessor runs first, it will cause TestInstanceAccessor.getAllInstances to fail. Using a new cluster for TestInstanceAccessor fixes the issue.
@zpinto
Copy link
Contributor Author

zpinto commented Nov 28, 2023

Was able to fix TestInstanceAccessor.getAllInstances test.

Test failed: testCacheDataUpdates(org.apache.helix.metaclient.impl.zk.TestZkMetaClientCache) #2693 which is now fixed by #2705

Assuming that review looks good, this PR is ready to be merged.

Final Commit Message:
Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.

@zpinto
Copy link
Contributor Author

zpinto commented Nov 28, 2023

Thank you so much for the review @xyuanlu!

This PR is ready to be merged.

Final Commit Message:
Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.

@xyuanlu xyuanlu merged commit 310b946 into apache:ApplicationClusterManager Nov 28, 2023
1 of 2 checks passed
asfgit pushed a commit that referenced this pull request Dec 8, 2023
…e picking assignable instances in the cache. (#2702)

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.
asfgit pushed a commit that referenced this pull request Dec 13, 2023
…e picking assignable instances in the cache. (#2702)

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.
xyuanlu pushed a commit that referenced this pull request Dec 20, 2023
…e picking assignable instances in the cache. (#2702)

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants